https://dx.doi.org/10.1088/1475-7516/2023/02/014

Summary

  • Gaussian processes (GPs) have two components, mean function and kernel function.
  • Often the focus is put on the kernel function and the mean function assumed to be zero.
  • Arbitrary choice of mean function can lead to unphysical results.
  • Example: calculation of distance moduli of Type Ia supernovae under two cosmological models.
  • Marginalise over a family of mean functions to remove the effect of mean function choice.

GP Notation

For data \(\{(\boldsymbol{x}_i, y_i); i = 1, \ldots, N\}\) with \(\boldsymbol{x}_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}\),

\[y_i = f(\boldsymbol{x}_i) + \varepsilon_i\]

We propose a underlying function,

\[f(\cdot) \sim\mathcal{MVN} \left( \mu(\boldsymbol{x};\boldsymbol\theta_\mu), k(\boldsymbol{x}, \boldsymbol{x'}; \boldsymbol{\theta}_k) \right)\]

where \(\mu(\cdot)\) is the mean function and \(k(\cdot)\) is the covariance kernel function, with hyperparameters \(\boldsymbol\theta_\mu\) and \(\boldsymbol\theta_k\), respectively.

Mean Function

If we were to take many realisations of a GP, the mean of these over the support would be the specified mean function.

For example, \(\mu(\boldsymbol{x}) = \boldsymbol{0}\), \(k(x, x') = \exp\{-\|x - x'\|^2\}\):

GP Regression

\[ \left[ {\begin{array}{c} \boldsymbol{y}\\ \boldsymbol{f^*}\\ \end{array} } \right] \sim \mathcal{N} \left(\left[ \begin{array}{c} \boldsymbol{\mu}\\ \boldsymbol{\mu^*} \\ \end{array} \right], \left[ \begin{array}{cc} k(\boldsymbol{x}, \boldsymbol{x}) + C& k(\boldsymbol{x}, \boldsymbol{x^*})\\ k(\boldsymbol{x^*}, \boldsymbol{x}) & k(\boldsymbol{x^*}, \boldsymbol{x^*}) \\\end{array}\right] \right) \]

where \(\boldsymbol{x}\) are observed points whose values are \(\boldsymbol{y}\), and \(\boldsymbol{x^*}\) are target points with predicted values \(\boldsymbol{f^*}\).

The reconstruction \(\boldsymbol{f^*}\) is dependent on the choice of \(\boldsymbol{\mu^*}\) and \(k(\cdot)\) which themselves have hyperparameters \(\boldsymbol{\theta}\).

Conditional Distribution

We actually want the reconstruction conditioned on observations, \(\boldsymbol{y}\).

\[\boldsymbol{f^*} \vert \; \boldsymbol{y} \sim \mathcal{MVN}(\boldsymbol{\bar{f^*}}, \mathrm{Cov}(\boldsymbol{f^*}))\]

Chain Rule & Marginalisation

Joint probabilities expressed as conditional probabilities:

\[P(A, B) = P(A|B) \times P(B)\]

\[P(A,B | C) = P(A|B,C) \times P(B|C)\]

Variables can be “integrated out”:

\[\begin{align} P(X) = &\int_y P(X, Y = y) \mathrm{d}y\\ = & \int_y P(X \vert Y = y) \times P(Y = y) \mathrm{d}y \end{align}\]

GP Regression

Marginal likelihood of conditional distribution

\[p(\boldsymbol{y} | \boldsymbol{x}, \boldsymbol{\mu}, \boldsymbol{\theta}_k) = \int p(\boldsymbol{y}|\boldsymbol{f}, \boldsymbol{x}) \times p(\boldsymbol{f}|\boldsymbol{x}, \boldsymbol{\mu}, \boldsymbol{\theta}_k) \;d\boldsymbol{f}\]

We can choose values for \(\mu\) and \(\theta\) that maximise this quantity to get the “best fit” Gaussian process.

There are sometimes analytical solutions for this.

Data

  • Pantheon SNIa compilation contains 1048 points upto \(z \simeq 2.3\).
  • Mock data based on Pantheon with added Gaussian white noise.
  • Two models for comparison
    • Phenomenologically emergent dark energy (PEDE) model, and
    • Chevellier-Polarski-Linder (CPL) parameterisation of DE equation of state.

Method

Apply GP regression with

  1. Different mean functions
    • Zero
    • “Best fit”
  2. Different kernel functions
    • Squared Exponential
    • Matern 3/2
  3. Marginalise over different mean functions

Results: Zero mean function

Results: Marginal Likelihood

\[k(x, x'; \sigma_f, \ell) = \sigma_f^2 \exp\left\{-\frac{1}{2}\left( \frac{x - x'}{\ell}\right)^2\right\}\qquad \eta, \ell \gt 0\]

A Choice

  1. Find the “best” hyperparameter settings by optimising the marginal likelihood. \[p(\boldsymbol{y} | \boldsymbol{x}, \boldsymbol{\mu}, \boldsymbol{\theta}_k) = \int p(\boldsymbol{y}|\boldsymbol{f}, \boldsymbol{x}) \times p(\boldsymbol{f}|\boldsymbol{x}, \boldsymbol{\mu}, \boldsymbol{\theta}_k) \;d\boldsymbol{f}\]

  2. Marginalise over all values of hyperparameters to remove their effects. \[p(\boldsymbol{y} \vert \boldsymbol{x}, \boldsymbol{\mu}) = \iint p(\boldsymbol{y} \vert \boldsymbol{f}, \boldsymbol{x})\, p(\boldsymbol{f}\vert \boldsymbol{x}, \boldsymbol{\theta}) \mathrm{d}\boldsymbol{f}\mathrm{d}\boldsymbol\theta\]

Results: Full-marginalisation

\[p(\boldsymbol{y} \vert \boldsymbol{x}) = \iiint p(\boldsymbol{y} \vert \boldsymbol{f}, \boldsymbol{x})\, p(\boldsymbol{f}\vert \boldsymbol{x}, \boldsymbol{\mu}, \boldsymbol{\theta}) \mathrm{d}\boldsymbol{f}\mathrm{d}\boldsymbol\theta\mathrm{d}\boldsymbol\mu\]

Different kernels

Bias

Conclusions of Paper

  • Choice of mean function should not be arbitrary but carefully selected.
  • Don’t use zero mean function for non-stationary quantities.
  • Full marginalisation does not seem to be too sensitive to choice of family of mean functions.
  • Difference between Squared Exponential and Matern kernels is negligible.

Comments

  • Unsurprising but reassuring result.
  • Marginalisation is common practice, but model misspecification is still a risk.
  • There is a computational cost, especially for more complex families of functions.
    • use GPs for the mean function?!
  • Alternative ways to prevent “unphysical” reconstructions, e.g., constraints on priors, transformations, etc.
  • Comparing derivatives is very challenging.